367 research outputs found
Early register release for out-of-order processors with register windows
Register windows is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper we propose two early register release techniques that leverages register windows to drastically reduce the register requirements, and hence reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the same number as logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version
Leveraging register windows to reduce physical registers to the bare minimum
Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version
Time-predictable parallel programming models
Embedded Computing (EC) systems are increas-ingly concerned with providing higher performance in real-time while HPC applications require huge amounts of information to be processed within a bounded amount of time. Addressing this convergence and mixed set of requirements needs suitable programming methodologies to exploit the massively parallel computation capabilities of the available platforms in a pre-dictable way. OpenMP has evolved to deal with the programma-bility of heterogeneous many-cores, with mature support for fine-grained task parallelism. Unfortunately, while these features are very relevant for EC heterogeneous systems, often modeled as periodic task graphs, both the OpenMP programming interface and the execution model are completely agnostic to any timing requirement that the target applications may have. The goal of our work is to enable the use of the OpenMP parallel programming model in real-time embedded systems, such that many-cores architectures can be adopted in critical real-time embedded systems. To do so, it is required to guarantee the timing behavior of OpenMP applications
Techniques for reducing and bounding OpenMP dynamic memory
OpenMP offers a tasking model very convenient
to develop critical real-time parallel applications by virtue of
its time predictability. However, current implementations make
an intensive use of dynamic memory to efficiently manage the
parallel execution. This jeopardizes the qualification process
and limits the use of OpenMP in architectures with limited
amount of memory. This work introduces an OpenMP framework
that statically allocates the data structures needed to efficiently
manage parallel execution in OpenMP programs. We achieve the
same performance than current implementations, while bounding
and reducing the dynamic memory requirements at runtime
Cine-forum: “La felicidad en el séptimo arte"
Comunicación presentada en el Curso "La felicidad humana", dentro de los Cursos de verano UBU 201
Modeling high-performance wormhole NoCs for critical real-time embedded systems
Manycore chips are a promising computing platform to cope with the increasing performance needs of critical real-time embedded systems (CRTES). However, manycores adoption by CRTES industry requires understanding task's timing behavior when their requests use manycore's network-on-chip (NoC) to access hardware shared resources. This paper analyzes the contention in wormhole-based NoC (wNoC) designs - widely implemented in the high-performance domain - for which we introduce a new metric: worst-contention delay (WCD) that captures wNoC impact on worst-case execution time (WCET) in a tighter manner than the existing metric, worst-case traversal
time (WCTT). Moreover, we provide an analytical model of the WCD that requests can suffer in a wNoC and we validate it against wNoC designs resembling those in the Tilera-Gx36 and the Intel-SCC 48-core processors. Building on top of our WCD analytical model, we analyze the impact on WCD that different design parameters such as the number of virtual channels, and we make a set of recommendations on what wNoC setups to use in the context of CRTES.Peer ReviewedPostprint (author's final draft
A static scheduling approach to enable safety-critical OpenMP applications
Parallel computation is fundamental to satisfy the performance requirements of advanced safety-critical systems. OpenMP is a good candidate to exploit the performance opportunities of parallel platforms. However, safety-critical systems are often based on static allocation strategies, whereas current OpenMP implementations are based on dynamic schedulers. This paper proposes two OpenMP-compliant static allocation approaches: an optimal but costly approach based on an ILP formulation, and a sub-optimal but tractable approach that computes a worst-case makespan bound close to the optimal one.This work is funded by the EU projects P-SOCRATES (FP7-ICT-2013-10) and HERCULES (H2020/ICT/2015/688860), and the Spanish Ministry of Science and Innovation under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Big Data Analytics for Smart Cities: The H2020 CLASS Project
Applying big-data technologies to field applications has resulted in several new needs. First, processing data across a
compute continuum spanning from cloud to edge to devices, with varying capacity, architecture etc. Second, some computations need to be made predictable (real-time response), thus supporting both data-in-motion processing and larger-scale data-at-rest processing. Last, employing an event-driven programming model that supports mixing different APIs and models, such as Map/Reduce, CEP, sequential code, etc.The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the CLASS Project (www.class-project.eu), grant agreement No. 780622.Peer ReviewedPostprint (author's final draft
Improving performance guarantees in wormhole mesh NoC designs
Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation
with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so.The research leading to these results is funded by the European Union Seventh
Framework Programme under grant agreement no. 287519 (parMERASA)
and by the Ministry of Science and Technology of Spain under contract TIN2012-34557. Milos Panic is funded by the Spanish Ministry of Education under the FPU grant FPU12/05966. Carles Hernández is jointly funded by the
Spanish Ministry of Economy and Competitiveness and FEDER funds through
grant TIN2014-60404-JIN. Jaume Abella is partially supported by the Ministry
of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship
number RYC-2013-14717.Peer ReviewedPostprint (author's final draft
Taskgraph: A Low Contention OpenMP Tasking Framework
OpenMP is the de-facto standard for shared memory systems in High-Performance
Computing (HPC). It includes a task-based model that offers a high-level of
abstraction to effectively exploit highly dynamic structured and unstructured
parallelism in an easy and flexible way. Unfortunately, the run-time overheads
introduced to manage tasks are (very) high in most common OpenMP frameworks
(e.g., GCC, LLVM), which defeats the potential benefits of the tasking model,
and makes it suitable for coarse-grained tasks only. This paper presents
taskgraph, a framework that uses a task dependency graph (TDG) to represent a
region of code implemented with OpenMP tasks in order to reduce the run-time
overheads associated with the management of tasks, i.e., contention and
parallel orchestration, including task creation and synchronization. The TDG
avoids the overheads related to the resolution of task dependencies and greatly
reduces those deriving from the accesses to shared resources. Moreover, the
taskgraph framework introduces in OpenMP the record-and-replay execution model
that accelerates the taskgraph region from its second execution. Overall, the
multiple optimizations presented in this paper allow exploiting fine-grained
OpenMP tasks to cope with the trend in current applications pointing to
leverage massive on-node parallelism, fine-grained and dynamic scheduling
paradigms. The framework is implemented on LLVM 15.0. Results show that the
taskgraph implementation outperforms the vanilla OpenMP system in terms of
performance and scalability, for all structured and unstructured parallelism,
and considering coarse and fine grained tasks. Furthermore, the proposed
framework considerably reduces the performance gap between the task and the
thread models of OpenMP
- …